What is modeling?
Can mean 3 things:
Error & cost:
Features:
Tables
Structured Data
Unstructured Data
Common types of structured data
Applied to credit loan problem:
m = number of data points
n = number of attributes
x_ij = jth attribute of ith data point
x_i1 = credit score of person 1
x_i2 = income of person i
y_i = response for data point i
y_i = 1 (if data point i is blue, -1 if data point i is red)
line = (a_1 * x_1) + (a_2 * x_2) + (a_n * x_n) + a_0 = 0
Notes:
Scaling linearly:
Scaling to a normal distribution:
Which method to use? Depends:
Idea:
Things to keep in mind:
Data has 2 types of patterns:
"Fitting" matches both:
How to measure a model's performance:
What if we want to compare 5 runs of SVM and KNN?
Problems:
Solution:
How much data to use?
How to split data?
Idea: For each of the k parts - train the model on all the other parts and evaluate it on the one remaining.
k-1 times for training, and 1 time for validationk is 10.k splits. Train the model again using all the data.Definition: Grouping data points
Use:
n dimensionsp is set to infinity.Why use infinity norms?
How it works:
k cluster centers within range of dataCharacteristics:
heuristic: Fast, good but not guaranteed to find the absolute best solution.k as well.k for optimization.k == # of data points may be the most theoretically optimal, but does that actually make sense for the task? k vs total distance to find the "elbow" of the curve. At a certain number, the benefit of adding another k becomes negligible.Classification task: Given a new data point, determine which cluster set the new data point belongs to. To do this, simply put it into whichever cluster centroid it is closest to.
Another classification task: What range of possible data points would we assign to each cluster?
Image of cluster region, aka "Voronoi diagram"
Data Preparation: Quantitative Examples
Things to watch out for before building models:
Definition: Data point that's very different from the rest.
Depends on the data:
Actions:
Definition: Determining whether something has changed.
Why:
Definition: Detect increase/decrease or both by the cumulative sum, "Has the mean of cumulative sum gone beyond a critical level"
How:
Terms:
Formula to detect increase (True if $S_t \ge T$)
$$ S_t = max{\left\{0, S_{t-1} + (X_t-\mu-C) \right\}} $$Formula to detect decrease (True if $S_t \ge T$). $\mu$ and $X_t$ is flipped.
$$ S_t = max{\left\{0, S_{t-1} + (\mu-X_t-C) \right\}} $$Note: Both can be used in conjunction to create a control chart, where $S_t$ is plotted over time and if it ever gets beyond this threshold line, it shows that CUSUM detected a change.
Time series data will have a degree of randomness. Exponential smoothing accounts for this by smoothing the curve.
Example:
Time series complexities
Trends
Exponential Smoothing, but with $T_t$ (trend at time period $t$): $$S_t = \alpha x_t + (1 - \alpha)(S_{t-1} + T_{t-1})$$
Trend at time $t$ based on delta between observed value $S_t$ and $S_{t-1}$ with a constat $\beta$. $$T_t = \beta (S_t - S_{t-1}) + (1-\beta)T_{t-1}$$
Cyclic
Two ways to calculate:
Baseline formula (including trend + seasonality)
$$ S_t = \frac{\alpha x_t}{C_{t-L}} + (1 - \alpha)(S_{t-1} + T_{t-1}) $$Update the seasonal (cyclic) factor in a similar way as trends:
$$C_t = \gamma(\frac{x_t}{S_t}) + (1 - \gamma)C_{t-L}$$Example: Sales trends
Starting Conditions
For trend:
For multiplicative seasonality
"Smoothing"
Graph of what it looks like:
"Exponential"
Each time period estimate can be plugged in like this:
Given basic exponential smoothing equation
$$S_t = \alpha x_t + (1-\alpha)S_{t-1}$$We want to make a prediction $S_{t+1}$. Since $X_{t+1}$ is unknown, replace it with $S_t$.
Using $S_t$, the forecast for time period $t+1$ is
$$F_{t+1} = \alpha S_t + (1-\alpha)S_t$$hence, our estimate is the same as our latest baseline estimate
$$F_{t+1} = S_t$$Factoring in trend/cycle
Above equation can beextrapolated to trend/cycle calculations.
Best estimate of trend is the most current trend estimate:
$$F_{t+1} = S_t + T_t$$Same for cycle (multiplicative seasonality)
$$F_{t+1} = (S_t + T_t) C_{(t+1)-L}$$Where $F_t+k = (S_t + kT_t)C_{(t+1)-L+(k-1)}$, k=1,2,...
3 key parts
1. Differences
For example:
$$D_{(1)t} = (X_t - X_{t-1})$$
$$D_{(2)t} = (x_t - x_{t-1}) - (x_{t-1} - x_{t-2})$$
$$D_{(3)t} = [(x_t - x_{t-1}) - (x_{t-1} - x_{t-2})] - [(x_{t-1} - x_{t-2}) - (x_{t-2} - x_{t-3}]$$
2. Autoregression
Definition: Predicting the current value based on previous time periods' values.
Augoregression's exponential smoothing:
Order-p autoregressive model:
"ARIMA" combines autoregression and differencing
3. Moving Average
ARIMA (p,d,q) model
$$ D_{(d)t} = \mu + \sum_{i=1}^{p}\alpha_i D_{(d)t-i} - \sum_{i=1}^{q}\theta_i(\hat{x}_{t-1} - x_{t-i}) $$Choose:
Other flavors of ARIMA
Definition: Estimate or forecast the variance of something, given a time-series data.
Motivation:
Mathematical Model:
$$ \sum_t^2 = \omega + \Sigma_{i-1}^p\beta_i\sigma_{t-i}^2 + \sum_{i=1}^q \gamma_i\epsilon_{t-i}^2 $$What it explains:
Definition: Linear regression with one predictor
Example:
Sum of Squared Errors $$ \sum_{i=1}^n(y_i - \hat{y}_i)^2 = \sum_{i=1}^n(y_i - (a_0 + a_1 x_{i1})^2 $$
Best-fit regression line
How it works:
AIC applied to regression
Equation:
Example:
Relative likelihood $$ e^\frac{(AIC_1 - AIC_2)}{2} $$
Applied to Models 1 & 2: $$ e^\frac{(75 - 80)}{2} = 8.2\% $$
Result:
Characteristics:
BIC Metrics - Rule of thumb
Baseball Example: Determine average number of runs a home run is worth.
Equation:
$$ Runs Scored = a_0 + a_1[Number of HR] + a_2[Number of Triples] + \ldots + a_m[Other Predictors] $$Applications of LR:
Causation: One thing causes another thing Correlation: Two things tend to happen or not happen together - neither of them might cause the other.
Application in Linear Regression:
How to decide if there is causation?
Method 1: We can adjust the data so the fit is linear.
Method 2: We can transform the response
Transform response with logarithmic function $$ \log(y) = a_0 + a_1x_1 + a_2x_2 + \ldots + a_mx_m $$
Box-Cox transformations (link) can be automated in statistical software.
Example: Variable interaction
Warnings About P-Value
(A bit more explanation on contextualizing the coefficient of 1.0 in this context explained in Piazza):
Notice that in that lecture (M5L6, timestamp ~3:10) he's talking about the value of the coefficientΒ relative to the scale of the attribute in question and the response value. A coefficient of 1.0 on the age attribute isn't that significant because the coefficient is in units of USD/year of age, and the response is a household income. We can make a decent guess what values the age variable will take (probably adult, probably younger than retirement age), and those values multiplied by 1.0 (on the order of tens of dollars) aren't going to make much difference in a US household income (on the order of tens of thousands of dollars). That phrase "the coefficient multiplied by the attribute value" is important.
So this isn't a rule of thumb about the value 1.0. This is advice to keep the scale of your attributes and response in mind when interpreting coefficients.
Adjusted R-squared: Adjusts for the number of attributes used.
Warnings about R-squared:
Key Assumptions of Linear Regression
How to calculate R-squared values directly from cross validation
# R-squared = 1 - SSEresiduals/SSEtotal
# total sum of squared differences between data and its mean
SStat <- sum(dat$Crime - mean(data$CRime))^2)
# for model, model2, and cross-validation, calculated SEres
SSres_model <- sum(model$residuals^2) #model 1
SSres_model2 <- sum(model2$residuals^2)
SSres_c <- attr(c, "ms")*nrow(dat) # MSE, times number of data points, gives sum of squared error
# Calculate R-squared
1 - SSres_model/SStat # initial model with insignificant predictors
# 0.803
1 - SSres_model2/SStat # model2 without insignificant predictors (based on p-value)
# 0.766
1 - SSres_c/SStat # cross-validated
# 0.638
# This shows that including the insignificant factors overfits
# compared to removing them, and even the fitted model is probably overfitted.
# This is not suprising since we only have 47 data points and 15 predictors (3:1 ratio)
# Good to have 10:1 or more.
Q0.1 - Using crime datset, apply PCA and then create a regression model using the first few principal components.
prcomp for PCA (scale data first scale=TRUE)Don't forget to unscale the coefficients to make a prediction for the new city (ie. do the scaling calculation in reverse).
Eigenvalues Correspond to Principal Components
prcomp does all of this under the hoodHow to decide which PC is important
prcomp will rank them for us. Why: Normality assumption
Box-Cox Transformation
Box-Cox Formula:
NOTE: First check whether you need the transformation (e.g. QQ plot)
Why: Time series data will often have a trend (e.g. inflation, seasons) which may bias the model. Need to adjust for these trends to run a correct analysis.
When: Whenever using factor-based model (regression, SVM) to analyze time-series data*
("Factor-based model" uses a bunch of factors to make a prediction, non-factor based model example would be a model using predictors based on time and previous values)
On What:
How:
Why:
What:
D_1 becomes new x-coordinate.D_2 becomes new y-coordinate.$X$: Initial matrix of data (scaled)
Find all of the eigenvectors of $X^TX$
PCA:
How to use beyond linear transformation
Problem: How do we get the regression coefficients for the original factors instead of PCs?
Example (Regression): PCA finds new $L$ factors ${t_{ik}}$, then regression finds coefficients $b_0, b_1, \ldots, b_L$.
$$ \begin{aligned} y_i &= b_0 + \Sigma^L_{k=1} b_k t_{ik} \\ &= b_0 + \Sigma^L_{k=1} b_k [ \Sigma^m_{j=1} x_{ij} v_{jk} ] \\ &= b_0 + \Sigma^L_{k=1} b_k [\Sigma^m_{j=1} x_{ij} v_{jk}] \\ &= b_0 + \Sigma^L_{k=1} x_{ij} [\Sigma^L_{k=1} b_k v_{jk}] \\ &= b_0 + \Sigma^L_{k=1} x_{ij} [a_j] \end{aligned} $$Implied regression coefficient for $x_j$: $$ a_j = \Sigma^L_{k=1} b_k v_{jk} $$
Given $A$: Square matrix
Points to consider:
Example: Classification
Takeaway:
Types:
Terminology:
Method:
How regression trees are actually calculated
Derivation:
Other uses of trees The model used at each node would be different:
2 things to consider:
Branching Method
(Note, this is one method - there are other ways to do this)
Key Ideas:
How to reject a potential branch
"Overfitting our model can be costly; make sure the benefit of each branch is greater than its cost"
We introduce randomness in 3 steps:
Effect:
Results:
Pros/Cons:
Why explainability matters:
Takeaway: Linear regression models are pretty explainable. Each coefficients is tied to a specific predictor.
Takeaway: Tree-based models are less explainable. Branching logic is often not intuitive. Random forest is even worse.
Definition: Turns regression model into a probability model.
Standard Linear Regression:
$$ y = a_0 + a_1x_1 + \ldots + a_jx_j $$Logistic Regression: Takes linear function and puts it into an exponential.
Now this function can return a number from $-\infty$ to $+\infty$.
Example 1:
Example 2:
Similarities
Differences
Measures of model quality:
Thresholding:
Area Under Curve (AUC) = Probability that the model estimates a random "yes" point higher than a random "no" point.
There are many metrics derived from confusion matrix, main ones to know are:
Example: Spam detection, confusion matrix
Idea: We can assign costs for each factor and calculate the sum.
$$ Cost = TP * Cost + FN * Cost + FP * Cost + TN * Cost $$Example: Cost of lost productivity
Used when we think the response follows a Poisson distribution.
$$ f(z) = \frac{\lambda^ze^{-\lambda}}{z!} $$Other splines:
We start with:
Estimate of how regression coefficients and the random error is distributed.
Example: Predict how tall a child will be as an adult based on:
NOTE: These use regression, but can be applied to any model.
What:
Types:
What: Applies optimization for variable selection.
Types:
Definition: Keep adding "good enough" factors until there's no more to add. Then remove those with high p-value.
Definition: Keep removing "bad" factors until there's no more to remove.
Definition: Combine forward selection / backward elimination.
What: Adds constraint to standard regression equation (or other algos), then optimizes on it.
Math:
Minimize standard regression equation: $$ \min{\Sigma^m_{i=1}} (y_i - (a_0 + a_1x_{i1} + \ldots + a_nx_{in}))^2 $$
Add "budget" $T$ to minimize squared sum of errors. This "constrains" the sum of coefficients from becoming too large.
Tips:
What: Constrain combination of absolute value of coefficients and their squares.
Math (Minimization same as LASSO):
Add contrainst for coefficients and their squares:
$$ \text{Subject To } \lambda\Sigma^n_{j=1} |a_j| + (1-\lambda)\Sigma^n_{j=1}a_j^2 \le \tau $$Math (Minimization same as LASSO):
$$ \text{Subject To } \Sigma^n_{j=1} a_j^2 \le \tau $$Bias:
Variance:
The goal is to find the "sweet spot" where total errors is at the minimum, and underfit/overfit errors are both minimized.
Motivation: RR reduces overfitting (variance) by scaling down the magnitude of each coefficient.
Application: Use when model is performing worse on validation data than training data.
Motivation: How to decide which VS model to choose?
Classic options (forward/backward/stepwise):
Optimization options (LASSO, Elastic net)
Motivation:
Definition: When designing experiments, we need to control extraneous factors.
Definition: Factors that creates variation.
Example: Red car vs Blue cars
Definition: Create control and treatment group(s), test which of two alternatives is better.
Three things that needs to be true:
Motivation: Understand importance of all factors between A/B test.
Definition: Test the effectiveness of every combination.
Independent Factors: Test subset of combos, then use regression to estimate effects.
"Exploration vs Exploitation"
Definition:
Method:
K alternativesN number of tests, update probabilities of each one being the best alternative.Parameters:
Summary:
Definition:
Probability mass function (PMF)*
$$ P(X=1) = p \\ P(X=0) = 1-p $$(PMF represents the probability associated with each possible value of the random variable)
Example: Charity donations
Definition: Probability of getting $x$ successes out of $n$ independent identically distributed Bernoulli ($p$) trials.
Probability Mass Function: $$ P(X=x) = (^n_x)p^x(1-p)^n-x \\ = (\frac{n!}{(n-x)!x!})p^x(1-p)^{n-x} $$
Graph:
Definition: Probability of having $x$ Bernoulli($p$) failures until first success.
Examples:
Probability Mass Function
$$ P(X=x) = (1-p)^xp $$Graph:
Case Study: Assuming that each Bernoulli trial i.i.d - can we compare data to Geometric to test whether i.i.d is true?
Probability Mass Function: $$ f_x(x)=\frac{\lambda^xe^-\lambda}{x!} $$
Chart:
Relation to Poisson:
Probability Mass Function $$ f_x(x) = \lambda e^{-\lambda x} $$
Chart:
Case study of relation between Poisson and exponential (security line delays at airport):
Probability Mass Function:
Intuition of $k$:
Relation between Weibull / Poisson:
Motivation: Even if 2 sets of data have different scale of data / number of data points, 2 similar distributions will have about the same value at each quantile.
Use:
Note: Statistical tests can do the same, but sometimes visual representation is easier to understand.
Good fit vs Bad fit:
Misleading distribution caught by QQplot:
A more complex example: Adding 2 employees to the call center and only hold 4 customers in the queue.
Definition: Doesn't matter what has happened in the past, what matters is where we are now.
Memoryless Exponential distribution: Distribution of remaining time = initial distribution of time.
Memoryless Poisson distribution: Distribution of time to next arrival = initial distribution of time to next arrival.
Works the other way too:
Potential model params:
Other factors that Kendall notation doesn't include:
Note that Simulation can be used instead for cases where the mathematical model gets too complex (too many factors).
Definition: Build a model of something to study and analyze its behavior.
Examples:
Deterministic: No random. Same inputs give same outputs. Stochastic: Use when system has randomness. Focus of this lesson. Continuous-time: Changes happen continuously.
Discrete-event: Changes happen at discrete time points.
Components:
Under the hood: Uses (pseudo) random number generation, tracking, animation, etc
Replications: number of runs of simulation
Validation: Use real data to validate your simulation is giving reasonable results.
When to use simulation: Asking "what-if" questions
Steps:
Considerations:
Problem: Simulations can be validated by comparing it to data that already exists. However, experiments often simulate something that doesn't exist in reality. How to validate?
Problem 1: Matching disease prevalence of two diseases after simulation period did not work, because:
Solution 1: Use multiple measures of disease spread (not just one)
Problem 2: None of the metrics didn't match, even though medical paramters followed literature.
Solution 2: Run 2 simulations
Definition: Models transition probabilities between states of a memoryless system.
For each state $i$ in the model,
Definition: Long-term probability that the system will be in each state.
Markov chains are memoryless. Biggest limitation.
Can be used to model things that are not memoryless in the short run, but doesn't matter for a long run system (ex. Model urban sprawl, population, dynamics, disease propagation)
Some are more likely to be missing, this leads to bias in the missing data.
Examples:
With categorical data, we can simply add an extra "MISSING" category to the attribute.
Trickier with continuous data, some options:
0 to missing data, it might pull coefficients one way or another to account for missing data.Definition: Estimate missing values
Midrange Value: Choose mean, median (numeric) or mode (categorical)
Regression: Reduce or eliminate the problem of bias.
Pertubation: Add pertubation to each imputed value.
Imputation approaches:
Components:
Definition: Decisions that the optimization model will find the best values for.
Example: Political candidate scheduling
Definition: Restrictions on the decisions that can be made.
Definition: A measure of the quality of a solution, ie. the set of variables which we're trying to maximize or minimize Maximize expected new votes.
Feasible Solution: Variable values that satisfy all constraints.
Optimal solution: Feasible solution with the best objective value.
Problem: Satisfy soldiers' nutritional requirements at minimum cost.
Definitions:
Optimization model components:
Problem: Meet forecasted demand $d_i$ for each day of the week $i$
Wrong approach:
Right approach: Accounts for consecutive worker days. This way, optimization already accounts for union rules.
Idea: Binary variables allow more complex models.
Slide 1:
Slide 2:
Slide 3:
Violating the Constraints:
Let π¦π and π¦π both be binary variables, and let π₯π be a continuous variable. Which expression is equivalent to the constraint βIf π¦π=1, then π¦π must also be 1?
A: π¦π β₯ π¦π